Estimation Of Covariance Matrices
   HOME

TheInfoList



OR:

In
statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
, sometimes the
covariance matrix In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...
of a
multivariate random variable In probability, and statistics, a multivariate random variable or random vector is a list of mathematical variables each of whose value is unknown, either because the value has not yet occurred or because there is imperfect knowledge of its value. ...
is not known but has to be
estimated Estimation (or estimating) is the process of finding an estimate or approximation, which is a value that is usable for some purpose even if input data may be incomplete, uncertain, or unstable. The value is nonetheless usable because it is der ...
. Estimation of covariance matrices then deals with the question of how to approximate the actual covariance matrix on the basis of a sample from the
multivariate distribution Given two random variables that are defined on the same probability space, the joint probability distribution is the corresponding probability distribution on all possible pairs of outputs. The joint distribution can just as well be considered ...
. Simple cases, where observations are complete, can be dealt with by using the
sample covariance matrix The sample mean (or "empirical mean") and the sample covariance are statistics computed from a sample of data on one or more random variables. The sample mean is the average value (or mean value) of a sample of numbers taken from a larger popul ...
. The sample covariance matrix (SCM) is an
unbiased Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individual, a group, ...
and
efficient estimator In statistics, efficiency is a measure of quality of an estimator, of an experimental design, or of a hypothesis testing procedure. Essentially, a more efficient estimator, needs fewer input data or observations than a less efficient one to achie ...
of the covariance matrix if the space of covariance matrices is viewed as an
extrinsic In science and engineering, an intrinsic property is a property of a specified subject that exists itself or within the subject. An extrinsic property is not essential or inherent to the subject that is being characterized. For example, mass ...
convex cone In linear algebra, a ''cone''—sometimes called a linear cone for distinguishing it from other sorts of cones—is a subset of a vector space that is closed under scalar multiplication; that is, is a cone if x\in C implies sx\in C for every . ...
in R''p''×''p''; however, measured using the intrinsic geometry of
positive-definite matrices In mathematics, a symmetric matrix M with real entries is positive-definite if the real number z^\textsfMz is positive for every nonzero real column vector z, where z^\textsf is the transpose of More generally, a Hermitian matrix (that is, a c ...
, the SCM is a biased and inefficient estimator. In addition, if the random variable has a
normal distribution In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is : f(x) = \frac e^ The parameter \mu ...
, the sample covariance matrix has a
Wishart distribution In statistics, the Wishart distribution is a generalization to multiple dimensions of the gamma distribution. It is named in honor of John Wishart, who first formulated the distribution in 1928. It is a family of probability distributions define ...
and a slightly differently scaled version of it is the
maximum likelihood estimate In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statisti ...
. Cases involving
missing data In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data. Miss ...
,
heteroscedasticity In statistics, a sequence (or a vector) of random variables is homoscedastic () if all its random variables have the same finite variance. This is also known as homogeneity of variance. The complementary notion is called heteroscedasticity. The s ...
, or autocorrelated residuals require deeper considerations. Another issue is the
robustness Robustness is the property of being strong and healthy in constitution. When it is transposed into a system, it refers to the ability of tolerating perturbations that might affect the system’s functional body. In the same line ''robustness'' ca ...
to
outlier In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...
s, to which sample covariance matrices are highly sensitive. Statistical analyses of multivariate data often involve exploratory studies of the way in which the variables change in relation to one another and this may be followed up by explicit statistical models involving the covariance matrix of the variables. Thus the estimation of covariance matrices directly from observational data plays two roles: :* to provide initial estimates that can be used to study the inter-relationships; :* to provide sample estimates that can be used for model checking. Estimates of covariance matrices are required at the initial stages of
principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
and
factor analysis Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. For example, it is possible that variations in six observed ...
, and are also involved in versions of
regression analysis In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one ...
that treat the
dependent variable Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...
s in a data-set, jointly with the
independent variable Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...
as the outcome of a random sample.


Estimation in a general context

Given a
sample Sample or samples may refer to: Base meaning * Sample (statistics), a subset of a population – complete data set * Sample (signal), a digital discrete sample of a continuous analog signal * Sample (material), a specimen or small quantity of s ...
consisting of ''n'' independent observations ''x''1,..., ''x''''n'' of a ''p''-dimensional
random vector In probability, and statistics, a multivariate random variable or random vector is a list of mathematical variables each of whose value is unknown, either because the value has not yet occurred or because there is imperfect knowledge of its value. ...
''X'' ∈ R''p''×1 (a ''p''×1 column-vector), an
unbiased Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individual, a group, ...
estimator In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguished. For example, the ...
of the (''p''×''p'')
covariance matrix In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...
:\operatorname = \operatorname\left left(X-\operatorname[Xright_)_\left_(X-\operatorname[X.html" ;"title=".html" ;"title="left(X-\operatorname[X">left(X-\operatorname[Xright ) \left (X-\operatorname[X">.html" ;"title="left(X-\operatorname[X">left(X-\operatorname[Xright ) \left (X-\operatorname[Xright)^\mathrm\right] is the Sample mean and covariance, sample covariance matrix :\mathbf = \sum_^n (x_i-\overline)(x_i-\overline)^\mathrm, where x_i is the ''i''-th observation of the ''p''-dimensional random vector, and the vector :\overline= \sum_^n x_i is the
sample mean The sample mean (or "empirical mean") and the sample covariance are statistics computed from a Sample (statistics), sample of data on one or more random variables. The sample mean is the average value (or mean, mean value) of a sample (statistic ...
. This is true regardless of the distribution of the random variable ''X'', provided of course that the theoretical means and covariances exist. The reason for the factor ''n'' âˆ’ 1 rather than ''n'' is essentially the same as the reason for the same factor appearing in unbiased estimates of sample variances and sample covariances, which relates to the fact that the mean is not known and is replaced by the sample mean (see
Bessel's correction In statistics, Bessel's correction is the use of ''n'' âˆ’ 1 instead of ''n'' in the formula for the sample variance and sample standard deviation, where ''n'' is the number of observations in a sample. This method corrects the bias in t ...
). In cases where the distribution of the
random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
''X'' is known to be within a certain family of distributions, other estimates may be derived on the basis of that assumption. A well-known instance is when the
random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
''X'' is normally distributed: in this case the
maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...
estimator In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguished. For example, the ...
of the covariance matrix is slightly different from the unbiased estimate, and is given by :\mathbf = \sum_^n (x_i-\overline)(x_i-\overline)^\mathrm. A derivation of this result is given below. Clearly, the difference between the unbiased estimator and the maximum likelihood estimator diminishes for large ''n''. In the general case, the unbiased estimate of the covariance matrix provides an acceptable estimate when the data vectors in the observed data set are all complete: that is they contain no missing elements. One approach to estimating the covariance matrix is to treat the estimation of each variance or pairwise covariance separately, and to use all the observations for which both variables have valid values. Assuming the missing data are
missing at random In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data. Mi ...
this results in an estimate for the covariance matrix which is unbiased. However, for many applications this estimate may not be acceptable because the estimated covariance matrix is not guaranteed to be positive semi-definite. This could lead to estimated correlations having absolute values which are greater than one, and/or a non-invertible covariance matrix. When estimating the
cross-covariance In probability and statistics, given two stochastic processes \left\ and \left\, the cross-covariance is a function that gives the covariance of one process with the other at pairs of time points. With the usual notation \operatorname E for the ...
of a pair of signals that are
wide-sense stationary In mathematics and statistics, a stationary process (or a strict/strictly stationary process or strong/strongly stationary process) is a stochastic process whose unconditional joint probability distribution does not change when shifted in time. Con ...
, missing samples do ''not'' need be random (e.g., sub-sampling by an arbitrary factor is valid).


Maximum-likelihood estimation for the multivariate normal distribution

A random vector ''X'' ∈ R''p'' (a ''p''×1 "column vector") has a multivariate normal distribution with a nonsingular covariance matrix Σ precisely if Σ ∈ R''p'' × ''p'' is a
positive-definite matrix In mathematics, a symmetric matrix M with real entries is positive-definite if the real number z^\textsfMz is positive for every nonzero real column vector z, where z^\textsf is the transpose of More generally, a Hermitian matrix (that is, a co ...
and the
probability density function In probability theory, a probability density function (PDF), or density of a continuous random variable, is a function whose value at any given sample (or point) in the sample space (the set of possible values taken by the random variable) can ...
of ''X'' is :f(x)=(2\pi)^\, \det(\Sigma)^ \exp\left(- (x-\mu)^\mathrm \Sigma^ (x-\mu)\right) where ''μ'' ∈ R''p''×1 is the
expected value In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a l ...
of ''X''. The
covariance matrix In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...
Σ is the multidimensional analog of what in one dimension would be the
variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...
, and :(2\pi)^\det(\Sigma)^ normalizes the density f(x) so that it integrates to 1. Suppose now that ''X''1, ..., ''X''''n'' are
independent Independent or Independents may refer to: Arts, entertainment, and media Artist groups * Independents (artist group), a group of modernist painters based in the New Hope, Pennsylvania, area of the United States during the early 1930s * Independ ...
and identically distributed samples from the distribution above. Based on the
observed value In probability and statistics, a realization, observation, or observed value, of a random variable is the value that is actually observed (what actually happened). The random variable itself is the process dictating how the observation comes abo ...
s ''x''1, ..., ''x''''n'' of this
sample Sample or samples may refer to: Base meaning * Sample (statistics), a subset of a population – complete data set * Sample (signal), a digital discrete sample of a continuous analog signal * Sample (material), a specimen or small quantity of s ...
, we wish to estimate Î£.


First steps

The likelihood function is: : \mathcal(\mu,\Sigma)=(2\pi)^\, \prod_^n \det(\Sigma)^ \exp\left(-\frac (x_i-\mu)^\mathrm \Sigma^ (x_i-\mu)\right) It is fairly readily shown that the
maximum-likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...
estimate of the mean vector ''μ'' is the "
sample mean The sample mean (or "empirical mean") and the sample covariance are statistics computed from a Sample (statistics), sample of data on one or more random variables. The sample mean is the average value (or mean, mean value) of a sample (statistic ...
" vector: :\overline=\frac. See the section on estimation in the article on the normal distribution for details; the process here is similar. Since the estimate \bar does not depend on Σ, we can just substitute it for ''μ'' in the
likelihood function The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood funct ...
, getting : \mathcal(\overline,\Sigma) \propto \det(\Sigma)^ \exp\left(- \sum_^n (x_i-\overline)^\mathrm \Sigma^ (x_i-\overline)\right), and then seek the value of Σ that maximizes the likelihood of the data (in practice it is easier to work with log \mathcal).


The trace of a 1 × 1 matrix

Now we come to the first surprising step: regard the
scalar Scalar may refer to: *Scalar (mathematics), an element of a field, which is used to define a vector space, usually the field of real numbers * Scalar (physics), a physical quantity that can be described by a single element of a number field such ...
(x_i-\overline)^\mathrm \Sigma^ (x_i-\overline) as the
trace Trace may refer to: Arts and entertainment Music * ''Trace'' (Son Volt album), 1995 * ''Trace'' (Died Pretty album), 1993 * Trace (band), a Dutch progressive rock band * ''The Trace'' (album) Other uses in arts and entertainment * ''Trace'' ...
of a 1×1 matrix. This makes it possible to use the identity tr(''AB'') = tr(''BA'') whenever ''A'' and ''B'' are matrices so shaped that both products exist. We get :\begin \mathcal(\overline,\Sigma) &\propto \det(\Sigma)^ \exp\left(- \sum_^n \left ( \left (x_i-\overline \right )^\mathrm \Sigma^ \left (x_i-\overline \right ) \right ) \right) \\ &=\det(\Sigma)^ \exp\left(- \sum_^n \operatorname \left ( \left (x_i-\overline \right ) \left (x_i-\overline \right )^\mathrm \Sigma^ \right ) \right) \\ &=\det(\Sigma)^ \exp\left(- \operatorname \left( \sum_^n \left (x_i-\overline \right ) \left (x_i-\overline \right )^\mathrm \Sigma^ \right) \right)\\ &=\det(\Sigma)^ \exp\left(- \operatorname \left( S \Sigma^ \right) \right) \end where :S=\sum_^n (x_i-\overline) (x_i-\overline)^\mathrm \in \mathbf^. S is sometimes called the
scatter matrix : ''For the notion in quantum mechanics, see scattering matrix.'' In multivariate statistics and probability theory, the scatter matrix is a statistic that is used to make estimates of the covariance matrix, for instance of the multivariate normal ...
, and is positive definite if there exists a subset of the data consisting of p affinely independent observations (which we will assume).


Using the spectral theorem

It follows from the
spectral theorem In mathematics, particularly linear algebra and functional analysis, a spectral theorem is a result about when a linear operator or matrix (mathematics), matrix can be Diagonalizable matrix, diagonalized (that is, represented as a diagonal matrix i ...
of
linear algebra Linear algebra is the branch of mathematics concerning linear equations such as: :a_1x_1+\cdots +a_nx_n=b, linear maps such as: :(x_1, \ldots, x_n) \mapsto a_1x_1+\cdots +a_nx_n, and their representations in vector spaces and through matrices. ...
that a positive-definite symmetric matrix ''S'' has a unique positive-definite symmetric square root ''S''1/2. We can again use the "cyclic property" of the trace to write :\det(\Sigma)^ \exp\left(- \operatorname \left( S^ \Sigma^ S^ \right) \right). Let ''B'' = ''S''1/2 Σ −1 ''S''1/2. Then the expression above becomes :\det(S)^ \det(B)^ \exp\left(- \operatorname (B) \right). The positive-definite matrix ''B'' can be diagonalized, and then the problem of finding the value of ''B'' that maximizes :\det(B)^ \exp\left(- \operatorname (B) \right) Since the trace of a square matrix equals the sum of eigenvalues ( "trace and eigenvalues"), the equation reduces to the problem of finding the eigenvalues λ1, ..., λ''p'' that maximize :\lambda_i^ \exp \left (-\frac \right ). This is just a calculus problem and we get λ''i'' = ''n'' for all ''i.'' Thus, assume ''Q'' is the matrix of eigen vectors, then :B = Q (n I_p) Q^ = n I_p i.e., ''n'' times the ''p''×''p'' identity matrix.


Concluding steps

Finally we get :\Sigma=S^ B^ S^=S^\left(\frac 1 n I_p\right)S^ = \frac S n, i.e., the ''p''×''p'' "sample covariance matrix" : = \sum_^n (X_i-\overline)(X_i-\overline)^\mathrm is the maximum-likelihood estimator of the "population covariance matrix" Î£. At this point we are using a capital ''X'' rather than a lower-case ''x'' because we are thinking of it "as an estimator rather than as an estimate", i.e., as something random whose probability distribution we could profit by knowing. The random matrix ''S'' can be shown to have a
Wishart distribution In statistics, the Wishart distribution is a generalization to multiple dimensions of the gamma distribution. It is named in honor of John Wishart, who first formulated the distribution in 1928. It is a family of probability distributions define ...
with ''n'' − 1 degrees of freedom. That is: :\sum_^n (X_i-\overline)(X_i-\overline)^\mathrm \sim W_p(\Sigma,n-1).


Alternative derivation

An alternative derivation of the maximum likelihood estimator can be performed via
matrix calculus In mathematics, matrix calculus is a specialized notation for doing multivariable calculus, especially over spaces of matrices. It collects the various partial derivatives of a single function with respect to many variables, and/or of a mult ...
formulae (see also differential of a determinant and differential of the inverse matrix). It also verifies the aforementioned fact about the maximum likelihood estimate of the mean. Re-write the likelihood in the log form using the trace trick: :\ln \mathcal(\mu,\Sigma) = \operatorname - \ln \det(\Sigma) - \operatorname \left \Sigma^ \sum_^n (x_i-\mu) (x_i-\mu)^\mathrm \right The differential of this log-likelihood is :d \ln \mathcal(\mu,\Sigma) = -\frac \operatorname \left \Sigma^ \left\ \right- \operatorname \left - \Sigma^ \ \Sigma^ \sum_^n (x_i-\mu)(x_i-\mu)^\mathrm - 2 \Sigma^ \sum_^n (x_i - \mu) \^\mathrm \right It naturally breaks down into the part related to the estimation of the mean, and to the part related to the estimation of the variance. The
first order condition In calculus, a derivative test uses the derivatives of a function to locate the critical points of a function and determine whether each point is a local maximum, a local minimum, or a saddle point. Derivative tests can also give information abou ...
for maximum, d \ln \mathcal(\mu,\Sigma)=0, is satisfied when the terms multiplying d \mu and d \Sigma are identically zero. Assuming (the maximum likelihood estimate of) \Sigma is non-singular, the first order condition for the estimate of the mean vector is : \sum_^n (x_i - \mu) = 0, which leads to the maximum likelihood estimator :\widehat \mu = \bar X = \sum_^n X_i. This lets us simplify :\sum_^n (x_i-\mu)(x_i-\mu)^\mathrm = \sum_^n (x_i-\bar x)(x_i-\bar x)^\mathrm = S as defined above. Then the terms involving d \Sigma in d \ln L can be combined as : - \operatorname \left( \Sigma^ \left\ \left nI_p - \Sigma^ S \right\right). The first order condition d \ln \mathcal(\mu,\Sigma)=0 will hold when the term in the square bracket is (matrix-valued) zero. Pre-multiplying the latter by \Sigma and dividing by n gives :\widehat \Sigma = S, which of course coincides with the canonical derivation given earlier. Dwyer points out that decomposition into two terms such as appears above is "unnecessary" and derives the estimator in two lines of working. Note that it may be not trivial to show that such derived estimator is the unique global maximizer for likelihood function.


Intrinsic covariance matrix estimation


Intrinsic expectation

Given a
sample Sample or samples may refer to: Base meaning * Sample (statistics), a subset of a population – complete data set * Sample (signal), a digital discrete sample of a continuous analog signal * Sample (material), a specimen or small quantity of s ...
of ''n'' independent observations ''x''1,..., ''x''''n'' of a ''p''-dimensional zero-mean Gaussian random variable ''X'' with covariance R, the
maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...
estimator In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguished. For example, the ...
of R is given by :\hat = \sum_^n x_ix_i^\mathrm. The parameter R belongs to the set of
positive-definite matrices In mathematics, a symmetric matrix M with real entries is positive-definite if the real number z^\textsfMz is positive for every nonzero real column vector z, where z^\textsf is the transpose of More generally, a Hermitian matrix (that is, a c ...
, which is a
Riemannian manifold In differential geometry, a Riemannian manifold or Riemannian space , so called after the German mathematician Bernhard Riemann, is a real manifold, real, smooth manifold ''M'' equipped with a positive-definite Inner product space, inner product ...
, not a
vector space In mathematics and physics, a vector space (also called a linear space) is a set whose elements, often called ''vectors'', may be added together and multiplied ("scaled") by numbers called '' scalars''. Scalars are often real numbers, but can ...
, hence the usual vector-space notions of expectation, i.e. "\mathrm
hat A hat is a head covering which is worn for various reasons, including protection against weather conditions, ceremonial reasons such as university graduation, religious reasons, safety, or as a fashion accessory. Hats which incorporate mecha ...
/math>", and
estimator bias In statistics, the bias of an estimator (or bias function) is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called ''unbiased''. In stat ...
must be generalized to manifolds to make sense of the problem of covariance matrix estimation. This can be done by defining the expectation of a manifold-valued estimator \hat with respect to the manifold-valued point R as :\mathrm_
hat A hat is a head covering which is worn for various reasons, including protection against weather conditions, ceremonial reasons such as university graduation, religious reasons, safety, or as a fashion accessory. Hats which incorporate mecha ...
\stackrel\ \exp_\mathrm\left exp_^\hat\right/math> where :\exp_(\hat) =\mathbf^\exp\left(\mathbf^\hat\mathbf^\right)\mathbf^ :\exp_^(\hat) =\mathbf^\left(\log\mathbf^\hat\mathbf^\right)\mathbf^ are the exponential map and inverse exponential map, respectively, "exp" and "log" denote the ordinary
matrix exponential In mathematics, the matrix exponential is a matrix function on square matrices analogous to the ordinary exponential function. It is used to solve systems of linear differential equations. In the theory of Lie groups, the matrix exponential gives ...
and
matrix logarithm In mathematics, a logarithm of a matrix is another matrix such that the matrix exponential of the latter matrix equals the original matrix. It is thus a generalization of the scalar logarithm and in some sense an inverse function of the matrix exp ...
, and E ·is the ordinary expectation operator defined on a vector space, in this case the
tangent space In mathematics, the tangent space of a manifold generalizes to higher dimensions the notion of '' tangent planes'' to surfaces in three dimensions and ''tangent lines'' to curves in two dimensions. In the context of physics the tangent space to a ...
of the manifold.


Bias of the sample covariance matrix

The intrinsic bias vector field of the SCM estimator \hat is defined to be :\mathbf(\hat) =\exp_^\mathrm_\left hat\right=\mathrm\left exp_^\hat\right/math> The intrinsic estimator bias is then given by \exp_\mathbf(\hat). For
complex Complex commonly refers to: * Complexity, the behaviour of a system whose components interact in multiple ways so possible interactions are difficult to describe ** Complex system, a system composed of many components which may interact with each ...
Gaussian random variables, this bias vector field can be shown to equal :\mathbf(\hat) =-\beta(p,n)\mathbf where :\beta(p,n) =\frac\left(p\log n + p -\psi(n-p+1) +(n-p+1)\psi(n-p+2) +\psi(n+1) -(n+1)\psi(n+2)\right) and ψ(·) is the
digamma function In mathematics, the digamma function is defined as the logarithmic derivative of the gamma function: :\psi(x)=\frac\ln\big(\Gamma(x)\big)=\frac\sim\ln-\frac. It is the first of the polygamma functions. It is strictly increasing and strictly ...
. The intrinsic bias of the sample covariance matrix equals :\exp_\mathbf(\hat) =e^\mathbf and the SCM is asymptotically unbiased as ''n'' → ∞. Similarly, the intrinsic
inefficiency Efficiency is the often measurable ability to avoid wasting materials, energy, efforts, money, and time in doing something or in producing a desired result. In a more general sense, it is the ability to do things well, successfully, and without ...
of the sample covariance matrix depends upon the
Riemannian curvature In the mathematical field of differential geometry, the Riemann curvature tensor or Riemann–Christoffel tensor (after Bernhard Riemann and Elwin Bruno Christoffel) is the most common way used to express the curvature of Riemannian manifolds. ...
of the space of positive-definite matrices.


Shrinkage estimation

If the sample size ''n'' is small and the number of considered variables ''p'' is large, the above empirical estimators of covariance and correlation are very unstable. Specifically, it is possible to furnish estimators that improve considerably upon the maximum likelihood estimate in terms of mean squared error. Moreover, for ''n'' < ''p'' (the number of observations is less than the number of random variables) the empirical estimate of the covariance matrix becomes
singular Singular may refer to: * Singular, the grammatical number that denotes a unit quantity, as opposed to the plural and other forms * Singular homology * SINGULAR, an open source Computer Algebra System (CAS) * Singular or sounder, a group of boar, ...
, i.e. it cannot be inverted to compute the
precision matrix In statistics, the precision matrix or concentration matrix is the matrix inverse of the covariance matrix or dispersion matrix, P = \Sigma^. For univariate distributions, the precision matrix degenerates into a scalar precision, defined as the ...
. As an alternative, many methods have been suggested to improve the estimation of the covariance matrix. All of these approaches rely on the concept of shrinkage. This is implicit in
Bayesian method Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, and ...
s and in penalized
maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...
methods and explicit in the Stein-type shrinkage approach. A simple version of a shrinkage estimator of the covariance matrix is represented by the Ledoit-Wolf shrinkage estimator.O. Ledoit and M. Wolf (2004a)
A well-conditioned estimator for large-dimensional covariance matrices
" ''
Journal of Multivariate Analysis The ''Journal of Multivariate Analysis'' is a monthly peer-reviewed scientific journal that covers applications and research in the field of multivariate statistical analysis. The journal's scope includes theoretical results as well as applicat ...
'' 88 (2): 365—411.
A. Touloumis (2015)
Nonparametric Stein-type shrinkage covariance matrix estimators in high-dimensional settings
''
Computational Statistics & Data Analysis ''Computational Statistics & Data Analysis'' is a monthly peer-reviewed scientific journal covering research on and applications of computational statistics and data analysis. The journal was established in 1983 and is the official journal of the I ...
'' 83: 251—261.
O. Ledoit and M. Wolf (2003)
Improved estimation of the covariance matrix of stock returns with an application to portofolio selection
" ''Journal of Empirical Finance'' 10 (5): 603—621.
O. Ledoit and M. Wolf (2004b)
Honey, I shrunk the sample covariance matrix
" ''
The Journal of Portfolio Management ''The Journal of Portfolio Management'' (also known as JPM) is a quarterly academic journal for finance and investing, covering topics such as asset allocation, performance measurement, market trends, risk management, and portfolio optimization. ...
'' 30 (4): 110—119.
One considers a
convex combination In convex geometry and vector algebra, a convex combination is a linear combination of points (which can be vectors, scalars, or more generally points in an affine space) where all coefficients are non-negative and sum to 1. In other word ...
of the empirical estimator (A) with some suitable chosen target (B), e.g., the diagonal matrix. Subsequently, the mixing parameter (\delta) is selected to maximize the expected accuracy of the shrunken estimator. This can be done by cross-validation, or by using an analytic estimate of the shrinkage intensity. The resulting regularized estimator (\delta A + (1 - \delta) B) can be shown to outperform the maximum likelihood estimator for small samples. For large samples, the shrinkage intensity will reduce to zero, hence in this case the shrinkage estimator will be identical to the empirical estimator. Apart from increased efficiency the shrinkage estimate has the additional advantage that it is always positive definite and well conditioned. Various shrinkage targets have been proposed: # the
identity matrix In linear algebra, the identity matrix of size n is the n\times n square matrix with ones on the main diagonal and zeros elsewhere. Terminology and notation The identity matrix is often denoted by I_n, or simply by I if the size is immaterial o ...
, scaled by the average
sample variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...
; # the
single-index model The single-index model (SIM) is a simple asset pricing model to measure both the risk and the return of a stock. The model has been developed by William Sharpe in 1963 and is commonly used in the finance industry. Mathematically the SIM is expres ...
; # the constant-correlation model, where the sample variances are preserved, but all pairwise correlation coefficients are assumed to be equal to one another; # the two-parameter matrix, where all variances are identical, and all
covariance In probability theory and statistics, covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the les ...
s are identical to one another (although ''not'' identical to the variances); # the
diagonal matrix In linear algebra, a diagonal matrix is a matrix in which the entries outside the main diagonal are all zero; the term usually refers to square matrices. Elements of the main diagonal can either be zero or nonzero. An example of a 2×2 diagonal ma ...
containing sample variances on the diagonal and zeros everywhere else; # the
identity matrix In linear algebra, the identity matrix of size n is the n\times n square matrix with ones on the main diagonal and zeros elsewhere. Terminology and notation The identity matrix is often denoted by I_n, or simply by I if the size is immaterial o ...
. The shrinkage estimator can be generalized to a multi-target shrinkage estimator that utilizes several targets simultaneously. Software for computing a covariance shrinkage estimator is available in R (packages corpcor and ShrinkCovMat), in
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...
(
scikit-learn scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support-vector ...
librar

, and in
MATLAB MATLAB (an abbreviation of "MATrix LABoratory") is a proprietary multi-paradigm programming language and numeric computing environment developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementation ...
.MATLAB code for shrinkage targets
scaled identitysingle-index modelconstant-correlation modeltwo-parameter matrix
an
diagonal matrix


See also

*
Propagation of uncertainty In statistics, propagation of uncertainty (or propagation of error) is the effect of variables' uncertainties (or errors, more specifically random errors) on the uncertainty of a function based on them. When the variables are the values of exp ...
*
Sample mean and sample covariance The sample mean (or "empirical mean") and the sample covariance are statistics computed from a sample of data on one or more random variables. The sample mean is the average value (or mean value) of a sample of numbers taken from a larger popula ...
*
Variance components In statistics, a random effects model, also called a variance components model, is a statistical model where the model parameters are random variables. It is a kind of hierarchical linear model, which assumes that the data being analysed are ...


References

{{DEFAULTSORT:Estimation Of Covariance Matrices Estimation methods Statistical deviation and dispersion